February 13, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
A deep neural network can be succinctly defined in two equations:
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]
\[\theta_i = g(\bb^T \varphi(\mathbf W_D^T \varphi(\mathbf W^T_{D - 1} \varphi(\mathbf W^T_{D - 2} ... \varphi(\mathbf W_1^T \mathbf x_i)))))\]
10 categories!
Loss:
\[-\expn \sum \limits_{k = 1}^N I(y_i = k) \log \text{Pr}(y_i = k)\]
Using a linear model:
\[\theta_{ik} = \sigma_k(\bb_k^T \mathbf x_i)\]
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
# Create a Logistic Regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
# Train the model on X_train and y_train
model.fit(X_train, y_train)
# Predict the labels for X_train
y_pred = model.predict(X_train)
# Compute the training error rate
training_error_rate = 1 - accuracy_score(y_train, y_pred)
# Compute the training log loss
y_pred_proba = model.predict_proba(X_train)
training_log_loss = log_loss(y_train, y_pred_proba)
print(f'Training error rate: {training_error_rate}')
print(f'Training log loss: {training_log_loss}')Training error rate: 0.04654999999999998
Training log loss: 0.17400512869479443
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, accuracy_score
# Create a Random Forest model with 1000 trees
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model on X_train and y_train
model.fit(X_train, y_train)
# Predict the labels for X_train
y_pred = model.predict(X_train)
# Compute the training error rate
training_error_rate = 1 - accuracy_score(y_train, y_pred)
# Compute the training log loss
y_pred_proba = model.predict_proba(X_train)
training_log_loss = log_loss(y_train, y_pred_proba)
print(f'Training error rate: {training_error_rate}')
print(f'Training log loss: {training_log_loss}')Training error rate: 0.0
Training log loss: 0.09439185597295903
Softmax NN w/ \(K\) categories:
Each hidden layer just has one set of coefficients.
\[h_{ij} = \mathbf W_j^T \mathbf h_{i(j-1)}\]
At the final layer (the output layer), create \(K\) sets of weights:
\[\theta_{ik} = \frac{\exp[\bb_k^T \mathbf h_{ij}]}{\sum \limits_{g = 1}^K \exp[\bb_g^T \mathbf h_{ij}]}\]
Each hidden layer is engineering a new feature set
The final feature set is used to train a multinomial logistic regression
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
from torchmetrics import Accuracy, MeanMetric
from pytorch_lightning.callbacks import RichProgressBar
class MNISTClassifier(pl.LightningModule):
def __init__(self, hidden_units: list = [256], lr: float = 1e-3):
"""
Args:
hidden_units (list): A list where each element specifies the number of units in that hidden layer.
lr (float): Initial learning rate.
"""
super(MNISTClassifier, self).__init__()
layers = []
input_dim = 784 # flattened image (28x28)
# Build hidden layers dynamically
for units in hidden_units:
layers.append(nn.Linear(input_dim, units))
layers.append(nn.ReLU())
input_dim = units
# Final output layer: maps last hidden layer to 10 classes
layers.append(nn.Linear(input_dim, 10))
self.model = nn.Sequential(*layers)
# Loss function (cross entropy = log loss)
self.criterion = nn.CrossEntropyLoss()
# Learning rate
self.lr = lr
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = self.criterion(logits, y)
preds = torch.argmax(logits, dim=1)
# Compute batch accuracy and then inaccuracy (1 - accuracy)
batch_accuracy = (preds == y).float().mean()
batch_inaccuracy = 1 - batch_accuracy
# Log both loss and inaccuracy on every training step
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
self.log('train_inaccuracy', batch_inaccuracy, on_step=True, on_epoch=True, prog_bar=True)
return loss
def configure_optimizers(self):
# Use Adam optimizer.
optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
return optimizer
def predict_step(self, batch, batch_idx):
# Return a dict with:
# - probabilities: a list of probabilities per class,
# - predicted: a list of predicted class indices via argmax,
# - target: the correct class labels.
x, y = batch
logits = self(x)
probs = F.softmax(logits, dim=1)
preds = torch.argmax(logits, dim=1)
return {
"probabilities": probs.tolist(), # list of lists (one per instance)
"predicted": preds.tolist(), # list of predicted class indices
"target": y.tolist() # list of true labels
}# -------------------- Data Preparation --------------------
# Load CSV data (images are already flattened to 784 features)
X_train = pd.read_csv('X_train_MNIST.csv').to_numpy() # shape: (N, 784)
y_train = pd.read_csv('y_train_MNIST.csv').to_numpy().flatten() # shape: (N,)
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train/255, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
# Create dataset and DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=10000, shuffle=True)# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128], lr=.01)
# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])
# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)
# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
# - "probabilities": list of lists (each inner list holds class probabilities),
# - "predicted": list of predicted class indices,
# - "target": list of true labels.
# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []
for batch_pred in predictions:
all_probs.extend(batch_pred["probabilities"])
all_preds.extend(batch_pred["predicted"])
all_targets.extend(batch_pred["target"])
# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs) # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds) # shape: (N,)
all_targets_tensor = torch.tensor(all_targets) # shape: (N,)
# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()
# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy
# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃ ┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ 0 │ model │ Sequential │ 101 K │ train │ │ 1 │ criterion │ CrossEntropyLoss │ 0 │ train │ └───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 101 K Non-trainable params: 0 Total params: 101 K Total estimated model params size (MB): 0
Final Training Log Loss (Cross Entropy): 0.0089
Final Training Inaccuracy Rate: 0.02%
# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128,128], lr=.01)
# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])
# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)
# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
# - "probabilities": list of lists (each inner list holds class probabilities),
# - "predicted": list of predicted class indices,
# - "target": list of true labels.
# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []
for batch_pred in predictions:
all_probs.extend(batch_pred["probabilities"])
all_preds.extend(batch_pred["predicted"])
all_targets.extend(batch_pred["target"])
# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs) # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds) # shape: (N,)
all_targets_tensor = torch.tensor(all_targets) # shape: (N,)
# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()
# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy
# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃ ┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ 0 │ model │ Sequential │ 118 K │ train │ │ 1 │ criterion │ CrossEntropyLoss │ 0 │ train │ └───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 118 K Non-trainable params: 0 Total params: 118 K Total estimated model params size (MB): 0
Final Training Log Loss (Cross Entropy): 0.0009
Final Training Inaccuracy Rate: 0.00%
# -------------------- Model Initialization & Learning Rate Tuning --------------------
# For example, use two hidden layers: 256 units and then 128 units.
model = MNISTClassifier(hidden_units=[128,128,128], lr=.01)
# Create a Lightning Trainer with a rich progress bar callback.
trainer = pl.Trainer(max_epochs=100, callbacks=[RichProgressBar()])
# -------------------- Train the Model --------------------
trainer.fit(model, train_loader)
# -------------------- Compute Metrics Using the Predictions Dictionary --------------------
# Use Lightning's built-in predict method.
predictions = trainer.predict(model, dataloaders=train_loader)
# 'predictions' is a list of dictionaries (one per batch) with:
# - "probabilities": list of lists (each inner list holds class probabilities),
# - "predicted": list of predicted class indices,
# - "target": list of true labels.
# Aggregate predictions from all batches.
all_probs = []
all_preds = []
all_targets = []
for batch_pred in predictions:
all_probs.extend(batch_pred["probabilities"])
all_preds.extend(batch_pred["predicted"])
all_targets.extend(batch_pred["target"])
# Convert lists to tensors.
all_probs_tensor = torch.tensor(all_probs) # shape: (N, 10)
all_preds_tensor = torch.tensor(all_preds) # shape: (N,)
all_targets_tensor = torch.tensor(all_targets) # shape: (N,)
# Compute the cross entropy loss (log loss) for each instance:
# For each instance, loss = -log(probability of the correct class)
nll = -torch.log(all_probs_tensor[torch.arange(all_probs_tensor.size(0)), all_targets_tensor])
final_log_loss = nll.mean()
# Compute the inaccuracy rate.
accuracy = (all_preds_tensor == all_targets_tensor).float().mean()
final_inaccuracy_rate = 1 - accuracy
# -------------------- Print Final Metrics --------------------
print(f"Final Training Log Loss (Cross Entropy): {final_log_loss.item():.4f}")
print(f"Final Training Inaccuracy Rate: {final_inaccuracy_rate.item() * 100:.2f}%")┏━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃ ┡━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ 0 │ model │ Sequential │ 134 K │ train │ │ 1 │ criterion │ CrossEntropyLoss │ 0 │ train │ └───┴───────────┴──────────────────┴────────┴───────┘
Trainable params: 134 K Non-trainable params: 0 Total params: 134 K Total estimated model params size (MB): 0
Final Training Log Loss (Cross Entropy): 0.0002
Final Training Inaccuracy Rate: 0.00%
With enough hidden units and layers, we can arbitrarily interpolate the complicated training set
Note that the random forest performed just as well for this tabular data
This is generally going to be true when the training data is a single outcome with a table of features!
Better generalization, usually
The real magic of this is that we were able to construct a 3 layer NN with ReLU activations, a softmax output layer, and cross entropy loss in roughly 10 lines of code!
It just took in the model instructions and trained our specific neural network with very little additional input needed
PyTorch is quite verbose in its setup, but this has more to do with how we train the model
The actual model specification is all contained in that init step.
This isn’t a trivial task
Specify the model
Compute the gradients
Set up the ADAM optimizer
Run to convergence
PyTorch is modular
I can add a new layer with any activation function easily with one new line of code
It just knows what to do!
The questions is how is it able to train models with so little code?
The method used by modern NN architectures is called autodiff
Specify the model as a computational graph that maps inputs to outputs through unknowns and simple primitive operations
For each node in the graph, compute the simple local gradient with respect to knowns and unknowns
Use backpropogation to combine the local gradients in an efficient way to get the gradient of the loss w.r.t. any local unknowns
Use SGD to minimize training loss w.r.t. the inputs and loss structure!
Computational Graph for Logistic Regression:
Two primitive operations:
Multiplication
Addition
One complex operation:
Two unknowns:
\(\mathbf W\) - a \(2 \times 1\) vector of coefficients
\(b\) - a single bias term
Operations:
Cross multiply X and W to get q
Add b to q to get z
Compute the sigmoid function for z
Add it all together to get the loss (Cross entropy)
A collection of simple operations arranged in an acyclic graph
Gradient descent:
Start by setting initial values for the unknowns (W and b)
Use current values to compute intermediate values (q and z)
Evaluate the gradient at the current values
Update the unknowns
Repeat a lot until convergence
The process of using the curent values of the unknowns to set the intermediates is referred to as the forward pass
From the bottom of the graph to the top, fill in any values!
Moving through the graph in the forward direction.
The next step is to find the gradient of the loss w.r.t. the unknowns.
Two unknowns:
\[\frac{\partial \mathcal L}{\partial W} \text{ ; } \frac{\partial \mathcal L}{\partial b}\]
We know that we can get these gradients using the chain rule
\[\frac{\partial \mathcal L}{\partial W} = \frac{\partial \mathcal L}{\partial z}\frac{\partial z}{\partial q}\frac{\partial q}{\partial W}\]
\[\frac{\partial \mathcal L}{\partial b} = \frac{\partial \mathcal L}{\partial z}\frac{\partial z}{\partial b}\]
For logistic regression, the loss layer is cross-entropy:
\[\mathcal L(z | \mathbf X , \mathbf y) = -\expn \sum y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]
To make this easier, we’re just going to look at a single instance and note that the gradients at each step are an average over the instances.
First step, find the derivative of the loss w.r.t. \(z_i\).
We’ve seen this a lot, so we won’t re-derive. The gradient for a cross-entropy layer with 2 categories is always a scalar:
\[\frac{\partial \mathcal L_i}{\partial z_i} = \sigma(z_i) - y_i\]
where \(\sigma_{i}\) is:
\[\frac{1}{1 + \exp[-z_i]}\]
Let’s move down the chain for \(\mathbf W\). Next, we want to find:
\[\frac{\partial \mathcal L_i}{\partial q_i} = \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial q_i}\]
\[\mathbf z_i = \mathbf q_i + b\]
The derivative here evaluates to 1, so:
\[\frac{\partial \mathcal L_i}{\partial q_i} = (\sigma(z_i) - y_i) \times 1\]
We also have the bias term in this step! So, we can find the gradient w.r.t. one of our unknowns:
\[\frac{\partial \mathcal L_i}{\partial b} = \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial b} = (\sigma(z_i) - y_i) \times 1\]
\[\frac{\partial \mathcal L}{\partial b} = \expn \frac{\partial \mathcal L_i}{\partial z_i}\frac{\partial z_i}{\partial b} = \expn (\sigma(z_i) - y_i)\]
Finally, we have
\[\frac{\partial \mathcal L_i}{\partial W} = \frac{\partial \mathcal L_i}{\partial q_i} \frac{\partial q_i}{\partial W}\]
\(\mathbf W\) is a \(P\) vector and \(q_i\) is a scalar.
\(\frac{\partial q_i}{\partial W}\) is a P-vector gradient
Using the rules of vector derivatives, we have:
\[q_i = \mathbf W^T \mathbf x_i\]
\[\frac{\partial q_i}{\partial W} = \mathbf x_i\]
\[\frac{\partial \mathcal L_i}{\partial W} = (\sigma(z_i) - y_i)\mathbf x_i\]
\[\frac{\partial \mathcal L}{\partial W} = \expn (\sigma(z_i) - y_i)\mathbf x_i\]
which we recognize as the gradient for the coefficients in logistic regression!
This is no different than what we’ve done before!
Algorithmically, even.
Backpropogation is an autodiff algorithm that has two steps:
Forward pass: set the values of any intermediate variables given the unknowns
Backwards pass: Compute the gradient given the intermediate values
Most of the hard work is done in the backwards pass!
Leverage the chain rule:
\[\text{Downstream Grad} = \text{Local Grad} \times \text{Upstream Grad}\]
\[\frac{\partial \mathcal L}{\partial x} = \frac{\partial \mathcal L}{\partial q} \frac{\partial q}{\partial x}\]
Now, let’s think about setting up a graph for a two-layer NN with \(K_1\) hidden units in the first layer and \(K_2\) hidden units in the second layer.
\[\mathcal L(\mathbf z | \mathbf X, \mathbf y) = -\expn y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]
\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]
\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]
\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2^T}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]
\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]
\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]
Unknowns:
Intercepts: \(a\), \(\mathbf b_2\), and \(\mathbf b_1\)
Weights: \(\boldsymbol \beta\), \(\mathbf W_2\), \(\mathbf W_1\)
Gradients to compute:
\(\frac{\partial \mathcal L_i}{\partial a}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf b_1}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf b_1}\),\(\frac{\partial \mathcal L_i}{\partial \boldsymbol \beta}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf W_2}\),\(\frac{\partial \mathcal L_i}{\partial \mathbf W_1}\)
Starting at the top:
\[\frac{\partial L_i}{z_i} = \underset{(1 \times 1)}{\sigma(z_i) - y_i}\]
\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]
For \(z_i\), three derivatives:
\(\frac{\partial z_i}{\partial \boldsymbol \beta} = \underset{(1 \times K_2)}{\mathbf h_{i2}^T}\)
\(\frac{\partial z_i}{\partial a} = \underset{(1 \times 1)}{1}\)
\(\frac{\partial z_i}{\mathbf h_{i2}} = \underset{(1 \times K_2)}{\bb^T}\)
At this point, three chains of derivatives:
\[\frac{\partial \mathcal L_i}{\partial \boldsymbol \beta} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\mathbf h_{i2}^T}\]
\[\frac{\partial \mathcal L_i}{\partial a} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times 1)}{1}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i2}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T}\]
The chains for \(\boldsymbol \beta\) and \(a\) are done!
2 out of 6 complete!
\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]
We want to find the derivative of a vector w.r.t. a vector.
Vector derivatives w.r.t a vector yield a Jacobian
\[\mathbf h_{i2} \in \mathbb R^{{K_2}} \text{ ; } \mathbf q_{i2} \in \mathbb R^{{K_2}}\]
\[\mathbf J_{jk} = \frac{\partial h_{i2,k}}{\partial q_{i2,j}}\]
A small change in element \(j\) of \(\mathbf q\), how much does element \(k\) of \(\mathbf h\) change?
Yields a \(K_2 \times K_2\) matrix of derivatives!
Shortcut:
Recall that the ReLU activation function is applied elementwise
If \(j \neq k\), does a change in \(q\) result in a change in \(h\)?
\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]
Nope! Only a change in the corresponding element results in a change!
Diagonals:
(Just pretend this is never equal to zero…)
\[ \frac{\partial h_{i2}}{\partial q_{i2}} = \begin{bmatrix} I(q_{i2,1} > 0) & 0 & \ldots & 0\\ 0 & I(q_{i2,2} > 0) & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & I(q_{i2,K_2} > 0)\\ \end{bmatrix} = I(\mathbf q_{i2} > 0)^T\mathcal I_{K_2} \]
Since it’s a diagonal matrix, we can think of applying this derivative as an elementwise operation:
\[\frac{\partial \mathcal L_i}{\partial \mathbf q_{i2}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)}\]
\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]
Three derivatives here:
\(\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_2}\) (Terminal)
\(\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}}\)
\(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2}\) (Terminal)
These are by far the trickiest step (but will always end up being the same thing)
Let’s start with the easiest one - \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2}\)
Each one is a \(K_2\) vector
As with elementwise ReLU, the \(k^{th}\) element of \(\mathbf b_2\) only changes the \(k^{th}\) element of \(\mathbf q_{i2}\)
Vector-vector derivative yields a \(K_2 \times K_2\) Jacobian, so:
\[ \frac{\partial \mathbf q_{i2}}{\partial \mathbf b_2} = \begin{bmatrix} 1 & 0 & \ldots & 0\\ 0 & 1 & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & 1\\ \end{bmatrix} \]
\[\frac{\partial \mathcal L_i}{\partial \mathbf b_2} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_2)}{\mathcal I_{K_2}}\]
Now, let’s work on \(\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}}\)
\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]
A general matrix derivative rule:
\[\frac{d}{d \mathbf x} \mathbf A \mathbf x = \mathbf A\]
\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf h_{i1}} = \mathbf W_2\]
The chain:
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2}\]
We can check our work by thinking about the dimensionality of the Jacobian
\(\mathbf h_{i1}\) will be a vector with \(K_1\) values
The derivative of the loss w.r.t. \(\mathbf h_{i1}\) is of length \(\mathbf K_1\)
Now the hard one:
\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{2}}\]
Why this is tricky is that we know what the size of the Jacobian will be
\(\mathbf q_{i2}\) is a \(K_2 \times 1\) vector and \(\mathbf W_2\) is a \(K_2 \times K_1\) matrix.
This will yield a Jacobian of size \(\mathbf K_2 \times \mathbf K_2 \times \mathbf K_1\) - a 3D tensor
We know this because the resulting Jacobian needs to be of size \(\mathbf K_1 \times K_2\) - how the loss changes as a function of each element of the weight matrix!
We don’t want to deal with this.
If \(K_1 = 128\) and \(K_2 = 128\), then this matrix is roughly 8.4 MB
As these models get larger, we really don’t want to be carrying around many matrices of that size…
Also really hard to think about!!!!!
A trick:
Let’s think about \(\mathbf q_{i2}\) in sum notation
\[\mathbf q_{i2} = \left[\sum \limits_{k = 1}^{K_1} \mathbf W_{1k} h_{k}, \sum \limits_{k = 1}^{K_1} \mathbf W_{2k} h_{k},...,\sum \limits_{k = 1}^{K_1} \mathbf W_{K_2k} h_{k} \right]\]
Let’s think about taking the derivative w.r.t. one element of \(\mathbf W\) - \(\mathbf W_{1,1}\)
\[q_1 = \mathbf W_{1,1} h_1 + \mathbf W_{1,2} h_2 + ... + \mathbf W_{1,K_1} h_{K_1}\]
\[q_2 = \mathbf W_{2,1} h_1 + \mathbf W_{2,2} h_2 + ... + \mathbf W_{2,K_1} h_{K_1}\]
\(\mathbf W_{1,1}\) only shows up once!
Now, let’s think of the partial of \(\mathbf q_{i2}\) w.r.t. \(\mathbf W_{1,1}\)
\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{1,1}} = \left[h_{1}, 0 , 0 , 0 , ... , 0 \right]^T\]
\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{1,2}} = \left[h_{2}, 0 , 0 , 0 , ... , 0 \right]^T\]
\[\frac{\partial \mathbf q_{i2}}{\partial \mathbf W_{2,1}} = \left[0, h_1 , 0 , 0 , ... , 0 \right]^T\]
Generic rule:
\[\mathbf v_{r,c}^T = \frac{\partial \mathbf q_{i}}{\partial \mathbf W_{r,c}} = \left[\underset{1}{0}, \underset{2}{0} , ... , \underset{r}{h_c} , ... , \underset{K_2}{0} \right]^T\]
Now, let’s grab the gradient w.r.t. the loss
\[\frac{\partial \mathcal L_i}{\partial \mathbf W_{2: r,c}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times 1)}{v_{r,c}}\]
Let \(\mathbf u\) represent the upstream gradient:
\[\underset{(1 \times K_2)}{\mathbf u} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)}\]
\[\underset{K_2 \times 1}{\mathbf v_{r,c}} = \frac{\partial \mathbf q_{i}}{\partial \mathbf W_{r,c}} = \left[\underset{1}{0}, \underset{2}{0} , ... , \underset{r}{h_c} , ... , \underset{K_2}{0} \right]^T\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf W_{2: r,c}} = \mathbf u \mathbf v_{r,c} = u_r h_c\]
\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_{2}} = \begin{bmatrix} u_1 h_1 & u_1 h_2 & \ldots & u_1 h_{K_1}\\ u_2 h_1 & u_2 h_2 & \ldots & u_2 h_{K_1}\\ \vdots & \vdots & \ddots & \vdots \\ u_{K_2} h_1 & u_{K_2} h_2 & \ldots & u_{K_2} h_{K_1}\\ \end{bmatrix} \]
Put more succinctly, we can define the gradient as:
\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_{2}} = \underset{(K_2 \times 1)}{\mathbf u^T} \times \underset{(1 \times K_1)}{\mathbf h_1}^T \]
All that work to get something workable!
4 out of 6 gradients derived!
2 more to go…
Unfortunately, we’ve still got work to do!
Or do we?
\[\mathcal L(\mathbf z | \mathbf X, \mathbf y) = -\expn y_i \log \sigma(z_i) + (1 - y_i)(1 - \sigma(z_i))\]
\[\underset{(1 \times 1)}{z_i} = \underset{(1 \times K_2)}{\bb^T} \underset{(K_2 \times 1)}{\mathbf h_{i2}} + \underset{(1 \times 1)}{a}\]
\[\underset{(K_2 \times 1)}{\mathbf h_{i2}} = \varphi(\underset{(K_2 \times 1)}{\mathbf q_{i2}})\]
\[\underset{(K_2 \times 1)}{\mathbf q_{i2}} = \underset{(K_2 \times K_1)}{\mathbf W_2^T}\underset{(K_1 \times 1)}{\mathbf h_{i1}} + \underset{(\mathbf K_2 \times 1)}{\mathbf b_2}\]
\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]
\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]
At this point, we have \(\frac{\partial \mathcal L_i}{h_{i1}}\):
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2}\]
Next is \(\frac{\partial \mathbf h_{i1}}{\partial \mathbf q_{i1}}\).
\[\underset{(K_1 \times 1)}{\mathbf h_{i1}} = \varphi(\underset{(K_1 \times 1)}{\mathbf q_{i1}})\]
We already know what this is going to be!
A diagonal matrix with \(I(\mathbf q_{i1} > 0)\) along the diagonal
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{i1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)}\]
One layer to go!
\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]
Three derivatives here:
\(\frac{\partial \mathbf q_{i1}}{\partial \mathbf W_1}\) (Terminal)
\(\frac{\partial \mathbf q_{i1}}{\partial \mathbf x_i}\) (Terminal)
\(\frac{\partial \mathbf q_{i2}}{\partial \mathbf b_1}\) (Terminal)
We’ve seen these derivative forms before…
\[ \frac{\partial \mathbf q_{i1}}{\partial \mathbf b_1} = \begin{bmatrix} 1 & 0 & \ldots & 0\\ 0 & 1 & \ldots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \ldots & 1\\ \end{bmatrix} \]
\[\frac{\partial \mathcal L_i}{\partial \mathbf b_{1}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)}\]
\[\underset{(K_1 \times 1)}{\mathbf q_{i1}} = \underset{(K_1 \times P)}{\mathbf W_1^T}\underset{(P \times 1)}{\mathbf x_i} + \underset{(\mathbf K_1 \times 1)}{\mathbf b_1}\]
Derivative w.r.t. \(\mathbf x_i\) uses matrix derivative rule
\[\frac{\partial \mathcal L_i}{\partial \mathbf x_{i}} = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \otimes \underset{(K_1 \times P)}{\mathbf W_1}\]
Gradient of the loss w.r.t. \(\mathbf W_1\) uses the outer product rule for linear layers:
\[ \mathbf u = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \]
\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_1} = \mathbf u^T \times \mathbf x_{i}^T \]
We have all of the rules!
As long as we know what kind of layer we’re computing, we know that the gradient/Jacobian has a known form!
PyTorch is just an autodiff machine
Interpret the user defined graph and draw the corresponding computational graph
Find the appropriate forward pass functions
Find the appropriate backwards pass gradient functions
On a backwards pass, multiply the gradients/Jacobians to get the SGD update step
This software just optimizes the differentiation routine for fitting models via chained equations!
For any neural network, you could program up your own function that does forward and backwards passes
At scale, though, your function is going to be walking through mud compared to PyTorch!
Blazing fast
C Compiled
A lot of very nice utility functions
With this setup, we can train arbitrarily complex neural networks
Large width
Large depth
Why don’t we do this all the time?
\[ \mathbf u = \underset{(1 \times 1)}{[\sigma(z_i) - y_i]} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{(1 \times K_2)}{I(\mathbf q_{i2} > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2} \odot \underset{(1 \times K_1)}{I(\mathbf q_{i1} > 0)} \]
\[ \frac{\partial \mathcal L_i}{\partial \mathbf W_1} = \mathbf u^T \times \mathbf x_{i}^T \]
Note that the gradient for the loss w.r.t. \(\mathbf W_1\) depends on the values of \(\mathbf W_2\)
If we extend this out to a NN of arbitrary depth, we get a function of the form:
\[ \mathbf u = [\sigma(z_i) - y_i] \otimes \bb^T \odot I(\mathbf q_{iD} > 0) \otimes \mathbf W_{D} \odot I(\mathbf q_{i(D - 1)} > 0)\otimes \mathbf W_{D-1}. . .\odot I(\mathbf q_{i(1)} > 0)\otimes \mathbf W_{1} \]
Because the gradient of the weights at a layer is a function of a product of all of the previous weights, we can run into issues that arise due to chains of multiplication
Let \(\epsilon\) be a positive value
\[(1 + \epsilon)^D \rightarrow \infty\]
\[(1 - \epsilon)^D \rightarrow 0\]
The key part is that the more depth we have, the smaller \(\epsilon\) needs to be in order for the gradient to explode or vanish
Gradient explosion is a problem that arises when we have a lot of weights (after passing through the nonlinear activation) that are generally greater than 1.
We shoot way off into the distance and will never return…
Two times we see this happen (most frequently):
Really deep models with a lot of multiplication of weights
Bad starting values
Next time, we’ll talk about residual connections which can help to mitigate the first one
One easy way to fix the exploding gradient problem is to use a method called gradient clipping
The standard SGD step:
\[\boldsymbol \theta_t = \boldsymbol \theta_{t - 1} - \eta \mathbf g(\theta_{t - 1})\]
where the gradient sets the optimal direction of descent
Note that the gradient is a vector in the loss space that sets the direction
The size is also set by the gradient to try to maximize the “benefit” of a single step
As long as we’re moving in the right direction, we can temper the size and be okay
Gradient clipping caps the amount with which we can move in a single step:
\[\boldsymbol \theta_t = \boldsymbol \theta_{t - 1} - \eta \mathbf g'(\theta_{t - 1})\]
\[\mathbf g'(\theta_{t - 1}) = \text{min}\left(1 , \frac{c}{\|g(\theta_{t - 1})\|} \right)g(\theta_{t - 1})\]
where \(c\) is some positive value.
Additionally, try to ensure that the input features are normalized to exist on a small subset of the real line close to zero
Standardize the variance
Scale to be between 0 and 1
Helps to ensure that the corresponding coefficients are small.
Another problem arises when we’re choosing starting values.
Choosing starting values for unknowns is a required part of GD routines.
A common way to initialize the procedure is to set each value of the hidden weights matrix to a random independent draw from a normal distribution.
Let’s assume that each column of \(\mathbf X\) has mean 0 and variance 1 - \(\mathbf x_i \in \mathbb R^P\)
Randomly draw each value of \(\mathbf W_1\) from a standard normal, as well.
Given an initial \(\mathbf W_1\) (a \(P \times K_1\) matrix), we can compute the set of \(K_1\) hidden values for an observation as:
\[\mathbf h_{i1} = \mathbf W_1^T \mathbf x_i\]
512 Hidden Units
The variance of the values is blowing up!
For a normalized feature set, the variance of the starting values for the hidden units is:
\[P \text{ or } N_{In}\]
as the hidden unit is a sum over the initial weight value.
On the other side, the variance of the gradient is going to explode as well depending on the number of outgoing connections. If \(\mathbf h_1\) connects to \(\mathbf h_2\) via:
\[\mathbf h_2 = \mathbf W_2^T \mathbf h_1\]
then the gradient of the loss is a function of the outgoing layer weights:
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_{id}} = \expn \mathbf u_i \mathbf W_{d + 1}^T\]
where \(\mathbf u_i\) is a \(1 \times K_{d + 1}\) vector and \(\mathbf W_{d + 1}^T\) is a \(K_{d + 1} \times K_d\) matrix.
512 Hidden Units
A smart initialization scheme is called Glorot Initialization
Given the number of incoming and outgoing connections affected by \(\mathbf W_k\), initialize the matrix to be random draws from a normal distribution with mean 0 and variance:
\[\sigma^2_{W_k} = \frac{2}{N_{In} + N_{Out}}\]
which drags down the initial variance of the hidden values!
On the other side, we also have the issue of vanishing gradients
The gradients for different parameters get stuck at small values due to architectural choices
We can’t escape from the suboptimal part of the loss space
These are a little harder to deal with
We’ll start here next time!